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ABSTRACT 

In  the  past  decade  a number  of  methods  have  been  developed  for 
selecting  the  "best"  or  at  least  a "good"  subset  of  variables  in  regression 
analysis.  For  various  reasons,  we  may  be  interested  in  including  only  a 
subset  say,  of  size  r < p,  the  number  of  independent  variables.  Various 
authors  have  considered  this  problem  and  a variety  of  techniques  are 
presently  being  used  to  construct  such  subsets.  Most  of  '*  • seem  to 
lack  justification  in  terms  of  statistical  theory. 

In  this  paper,  we  are  interested  in  deriving  a selection  procedure 
to  select  a random  size  optimal  subset  such  that  all  inferior  independent 
variables  are  excluded.  Some  results  on  the  efficiency  of  the  procedure 
are  also  discussed. 
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In  the  past  decade  a number  of  methods  have  been  developed  for 
selecting  the  "best"  or  at  least  a "good"  subset  of  variables  In  re- 
gression analysis.  For  various  reasons,  we  may  be  interested  in  Including 
only  a subset  say,  of  size  r < p,  the  number  of  Independent  variables.  Various 
authors  have  considered  this  problem  and  a variety  of  techniques  are 
presently  being  used  to  construct  such  subsets.  They  seem  to  lack  just- 
ification by  statistical  theory  (see  e.g.  [2],  [6]). 

Arvesen  and  McCabe  [1]  propose  a procedure  for  selecting  a subset 
within  a class  of  subsets  with  t (fixed)  Independent  variables,  taking 
into  account  the  statistical  variation  of  the  residual  sum  of  squares. 

An  algorithm  for  determining  the  necessary  constant  c given  the 
design  matrix  X is  presented  in  [4]. 

In  this  paper,  we  are  interested  in  deriving  a selection  procedure 
to  select  a random  size  subset  excluding  all  inferior  independent  var- 
iables (defined  later).  Some  results  on  the  efficiency  of  the  procedure 
are  also  discussed.  It  should  oe  pointed  out  that  our  approach  is  different 
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from  Arvesen  and  McCabe  [1]  and  the  approaches  used  by  others. 


Let  tjq,  denote  k+1  normal  populations  with  variances 


OQi 


k* 


Let 


>[1]  1 •••  l°[k] 


denote  the  ordered  variances. 


A population  it.  is  said  to  be 


superior  (or  good)  if  o);  ^ 6*o7, 


2 . X*  2 

> 6fo. 


inferior  (or  bad) 


? 2 
if  oq  £ ^2°i  ’ 


where  6^,  are  specified  constants  such  that  0 < < 1. 

We  are  interested  in  devising  a procedure  which  selects  a random 

size  subset,  that  excludes  all  the  inferior  populations  with  a 

probability  not  less  that  P*,  a specified  constant. 

Let  Q be  the  parameter  space  which  is  the  collection  of  all  possible 

2 2 2 

parameter  vecotor  e = (og,  op...,0|^).  Let  t^  and  t2  denote,  respectively, 
the  unknown  number  of  inferior  and  superior  popualtions  in  the  given 
collection  of  k+1  populations.  We  have  t^  ^0,  tg  ^ 1 and  t^  + tg  k+1. 

For  specified  6|  and  5|,  let 

2 

ndj.tj)  . (e:  < ...  < 


Then 


2 °0  2 2 , 
-°[k-t^]"^  ^‘'[k-t^+1]  ^ -°[k]^' 


n=  ufi(t,  ,t5>). 

^1*^2 

Let  CD  stand  for  a correct  decision  which  is  defined  to  be  selection 
of  the  subset  which  excludes  all  the  inferior  populations. 

Assume  the  following  standard  linear  model 
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Y = Xu 


^ ^ ( 1 »?]••••  •ifp.'i  ^ ’ * * ’ *®p-l  ^ * 


where  X is  an  Nxp  known  matrix  of  rank  p^N,  6 is  a pxl  parameter 

vector,  and  € - N(0,OqI|^),  and  1'  = (1,  1, ,1). 

In  what  follows,  (1)  which  has  k(=p-l)  independent  variables,  will 
be  viewed  as  the  "true"  model. 

Consider  the  models 


- " ^i)^(i)  ^ -i 


where  ’*  ■ * ~(i)~  *’ ’ * *^i“l ’ *^i+l 

2 

and  € ^ ' N(0,o^.I|^),  i=l,...,k.  X^^^associated  with  model  (2)  is  called 

population  ir.  (l^Jlk).  The  goal  is  to  reject  n^,  i.e.  to  reject  X^, 

2 

associated  with  j = k-t^+1 ,. . . ,k,  for  any  fixed  t^. 

Note  that 

ss,  = - rQ,Y. 

where  then  following  Searle  [5,  p.  57], 

SS./Og  ' X^’{r(Q.),  (Xe)'Q^(XB)/(2o^)}, 

where  r(Q^)  = N-k=v.  Note  that  the  noncentrality  parameter,  in 
general,  is  not  zero  and  that 


. ,S^), 


0?  = + (X6)'Q^(XP)/v. 
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Assume  that  Oq  is  known.  Since  the  problem  is  invariant  with  respect 

2 

to  the  scaling  by  Oq  > 0,  we  assume  without  loss  of  generality  that 


To  obtain  the  joint  distribution  of  SS^ . ,SSj^,  we  can  write 
Y'Q.Y  = UlU.. 

where 

(4)  U.  = B.Y  and  B.Bi  = I,  BIB.  = Q. 

B^  is  an  vxN  matrix. 

The  joint  distribution  of  U'  = (U.j . ,Up  is  multivariate 
normal  in  kv  dimensions  with  mean  vector  n'  = (n-j »• . . »nj^) , ~ 

B.X§,  and  covariance  matrix  E = (T.ji)  where  z = o.a.B^B^.  Note  that 

the  kvxkv  covariance  matrix  z is  possibly  singular.  Let  z = 

FF'  where  F is  of  full  column  rank  r (r  = rank(z)),  and  let  U = n + FA 

where  A - N(0,I^).  Thus,  the  joint  characteristic  function  of 
SS,  SS. 

-j-  is  (since  SS^.  = U^.Up, 

k 

cf.(t,,...,t.  ) = E{exp(i  I t.(U.)'U./2} 

I k J J J 

= |I  - iF'TFf"^ 

• expUn’{iT-TF(I-iF'TF)‘V'T}n] 

= |I-iETr^xpi{n'T(I-izT)'^r)}, 


where  T = diag(t^ t,^)  ® I^. 

We  propose  the  rejection  rule  of  the  form: 


5 


R:  Reject  n^.  (or  reject  X^)  is  and  only  if 


where  < c < 1 . 

Note  that  SS^  is  associated  with  or,  equivalently,  with 

population  and  degrees  of  freedom  v,  whereas  is  the 

i-th  smallest  sum  of  squares  and  is  the  sum  of  squares 

corresponding  to  the  (unknown)  i-th  smallest  expected  sum  of 
2 

squares  and  degrees  of  freedom  v.  Thus 

inf  P.(CD|R)  = inf  P„{  min  SS,,x  > 


-{  min  SS,.x>^} 

- k-t^+l<i<k  0)-62 


= inf  P { min  — ^ -1^  } 

- k-t^+Ui£k 


O^t^^k 


SS 

inf  P{  min  — ^ vc}. 

3 k-t^+Nij<k 


It  is  clear  that  the  bound  in  (4)  approaches  a minimum  value 

2 1 
as  the  parameters  k-t^+l^i^k  for  any  t^ , approach  ^ . Since 

this  limiting  probability  does  not  depend  on  the  value  of 

k-t^+l_<ij^k  for  any  t^ , we  can  assume  that  they  are  all  equal  to 


6 


inf  P(CD|R) 


= P{  min 
l<i<k 


Let  Zj  = iiSS.  - V - njDj)/(J)?  Then 


ss. 

p(-^  > 


= P{Z  i<j<k} 

^ /?6*  2 (2v)=^  “■ 


_ - (t)^»  lcj<k}. 

That  is,  the  worst  configuration  (asymptotically)  is  when  6=0. 

From  the  multivariate  central  limit  theorem,  it  follows  that  for 

large  v,  the  joint  distribution  of  Z.j,...,Z|^  does  not  depend  on 

(see  [1]).  Now  the  problem  is  the  same  as  to  compute 

the  joint  distribution  of  SS^ , ,SSj^.  Note  that  here  E = 

B.Bj  is  vxv  as  given  in  (4),  and  =6j"h. 

Following  the  discussion  in  [1],  we  have  the  joint  cumulant 
SS. 

generating  function  of  Uj^k,  is  (see  [5]). 


log|I-i?:T|  = ’ I i^r(ET)'^/r 
r=l 


= i I i'"c„(t,  ,...,t.  )/r. 
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Thus,  the  joint  cumulant  K , of  total  order  r = r,  + r~  + 

rj^,  can  be  obtained  from  the  rth  term  of  (5)  by  multiplying  the 
coefficient  of  i'^(t!|^i )...  (tj^k)  by  r^!...r|^!  Note  that  for  r = 1,2,3, 


3 = 1'^  KJ 


J=1  ^f3 


-6  ^ V-t 

h<i<j 

Expression  (6)  would  determine  an  Edgeworth  approximation  of  order 
V [3j.  To  compute  some  constant  C to  satisfy 

(7)  inf  P(CDIR)  = P{Z.  > l<j<k}  = P*, 

J “ 

where  Z.  = -3;  (SS.-v)»  l<j£k,  and  the  covaraince  matrix  of  the 
J /2v  1 

{Z.}  is  given  by  r - (p^j),  p^-j  = v tr(zj -r,^. j) , ij^j. 

The  Fortran  program  as  in  [4]  can  be  modi  fed  to  compute  (7). 
2 

Note  that  when  pq  is  unknown,  we  can  use  the  same  method 
as  above  to  construct  a rule  as  follows: 

SS.  SSq 

R':  Reject  (or  reject  X^)  if  and  only  if  ^ jijip- 
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where  6^  < c < 1 and 


SSq  = Y' {I  - X(X'X)'^X'  }Y  = Y'QgY. 


Here  SSq  is  x^.p. 


Expected  number  of  inferior  populations  included  in  the  selected  subset 
and  its  supremum. 


For  the  proposed  procedure  the  number  of  inferior  populations 
that  enter  into  the  selected  subset  is  a random  variable.  For  fixed 
values  of  k and  P*,  the  expected  value  of  is  a function  of  0. 

For  0 e fi(t^,t2),  and  large  v. 


k SS,.. 

I < VC} 

i=k-t^+l  0|-.^ 


i=k-t^+l  ^2 


I p{Z(^)  } 

i=k-t,+l  ^ ^ 2v)^ 


i=k-t,+l  1 


q)^,  i<j<k) 


where  and  are  associated  with  Thus  the  worst 

configuration  is  6 = 0.  Hence 


sup  sup 
6 0€n(k,l)  2 


= I p{z  < 

i=l  ^ ^ ^ 


Expected  number  of  superior  populations  that  enter  the  selected  subset 
and  its  infimum. 


Let  denote  the  random  number  of  superior  populations  that  enter 
the  selected  subset.  For  e 6 n(t^,t2)  and  for  large  v, 

*2 

Ee<T2l«)=  ,1, 


SS/M  , 

( i ) 1 , vc 

- ~T~  6*} 


2 ss,.. 

> [ P{ — ^ <_  vC} 


i=l 


’[i] 


'2 

r 

i=l 


Aic 


Hence 


inf  E (T„|R)  = min  inf  inf  E„(T,|R) 

- 6 0€J<t^,t2)  - ^ 


' /2^  ^ 
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where 


= 


(SS^  - v),  6*SS^  has  chi-square 


with  V degrees  of  freedom. 
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